Data Visualization Project 2: Analysis of the World Happiness Dataset
1 Introduction and Background
1.1 About the World Happiness Report
The World Happiness Report (WHR) is a reflection of the worldwide evaluation of happiness and well-being across 170 countries (World Happiness Report). WHR measures the happiness state on a yearly basis and emphasizes the need for incorporating well-being factors in government policy making.
The report is a collaboration of Gallup, the Oxford Wellbeing Research Centre, the United Nation’s Sustainable Development Solutions Network and the WHR’s Editorial Board. The first report was released in the year 2012.
1.1.1 Understanding the Happiness Score
Happiness scores are measured based on the survey responses from the Gallup World Poll (GWP) (World Happiness Report). The evaluation from the GWP primarily stems from answers of the survey questions and the Cantril Ladder score. The poll comprises of more than 100 questions that include global questions and region-specific topics regarding several well-being areas. Those well-being areas include law and order, good jobs, well-being, food and shelter, institutions and infrastructure, and brain gain (Wikipedia 2024). Gallup typically collects for a sample size of 1,000 people for each country annually.
Next, the Cantril Ladder score is a score range of 0 to 10, where 0 being the worst well-being condition and 10 being the best well-being condition. The Cantril Ladder asks to rate the current state of the respondents’ well-being on this scale. The researchers of the WHR take the average score of the last 3 years and use it as the happiness score for the current year in order to even out fluctuations based on regional events like changes in government (4. World Happiness Report 2024). This average ladder score is the overall national happiness score. Sometimes, the term “life evaluation” is used interchangeably with happiness score.
1.2 6 Explanatory Variables
In Figure 1, which is Figure 2.1 of the World Happiness Report 2024, the happiness score or life evaluation is indicated for each country in the entire dataset.
Six explanatory variables are derived from the poll answers, and the associations of these variables with the happiness score help explain the variations across different countries. Those variables are Gross Domestic Product (GDP) per capita, healthy life expectancy, social support, freedom to make life choices, perceptions of corruption and generosity. Another variable shown in the plot is called “Dystopia + residual”. Definitions of these variables will be explained in the Data section (Section 3) of this report.
According to the report, the sub-bars in Figure 1, which represent the explanatory variables, do not influence the total score reported for each country (World Happiness Report 2024), however they merely indicate the country’s overall score explained by each of the six explanatory variables (World Happiness Report). The sub-bars are measured by multiplying average data for the period of the last 3 years for each of those variables. For the 2024 report, the national average for 2021-2023 are calculated for the display of the sub-bars for each variable.
2 Software and Tools
For the analysis, RStudio Desktop environment was used. The following packages were used for data manipulation, visualization, and analysis.
heretidyverseGGallygridrandomForestcorrplotreadxlcowplotrnaturalearthplotly
Functions from these packages were used to conduct the analysis, with the following functions being particularly important:
randomForestrandomForestimplements Breiman’s random forest algorithm for classification and regression. Random subsets of data are used when training a random forest model. From the model evaluation, important features can be determined.
3 Data
3.1 World Happiness Report Data 2015-2023
The dataset was obtained from the Kaggle datasets collection and it includes the World Happiness Report data between 2015 and 2023. The dataset contains the country name, region name, national happiness score, and the other explanatory factors (Islam 2023). Those explanatory factors are GDP per capita, social support, healthy life expectancy, freedom to make life choices, generosity, and perceptions of corruption. The variables in the dataset represent socio-economic and well-being statuses from the individuals who participated in the Gallup World Poll in that duration. It is the dataset which is mainly used in this report.
3.2 World Happiness Report 2024 Chapter 2 Appendix
The dataset for the Chapter Appendix comes from the survey responses of the Gallup World Poll from 2005 and 2023. The dataset contains demographical information, namely country name and region name, and numerical information for GDP per capita, social support, healthy life expectancy, freedom to make life choices, generosity and perceptions of corruption. Contrary to the previous dataset, the GDP per capita and healthy life expentancy columns is in its true average representation and not adjusted for the life evaluation score. That is why this dataset is used for visualizing some of the variable distributions in Section 4.
3.3 Variable Definitions
Happiness score is a measure of subjective well-being (SWB) in the Gallup World Poll (GWP) covering years from 2005/06 to 2023. Unless stated otherwise, it is the national average response to the question of life evaluations and for the WHR it is averaged over 3 years to determine the value for the next year. The question in the poll that directly measures the score is “Please imagine a ladder, with steps numbered from 0 at the bottom to 10 at the top. The top of the ladder represents the best possible life for you and the bottom of the ladder represents the worst possible life for you. On which step of the ladder would you say you personally feel you stand at this time?”
This measure is also referred to as Cantril life ladder.
GDP per capita (GDP stands for gross domestic product) is the purchasing power parity at 2017 constant international dollar prices from World Development Indicators. When the latest GDP per capita is not available as of a certain date of the year, country-specific forecasts of real GDP growth from the Economic Outlook is used for average calculation. If it is unavailable from the Economic Outlook, then forecasts from the World Bank’s Global Economic Prospects are utilized.
Healthy Life Expectancy (LSE) at birth are based on the Global Health Observatory data by the World Health Organization (WHO).
Social support represents the national average of the ability to rely on someone in times of trouble and the representation is in binary responses (either 0 or 1). In the GWP, the binary response is collected from the question “If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?”.
Freedom represents the ability to make life decisions from the GWP question “Are you satisfied or dissatisfied with your freedom to choose what you do with your life?”.
Generosity is the national average of answers to the GWP question “Have you donated money to a charity in the past month?”.
Corruption is the perceptive measure from the following 2 questions in the GWP. Those questions are “Is corruption widespread throughout the government or not?” and “Is corruption widespread within businesses or not?”.
The following variables are not included in the used dataset but are explained in the context of completeness:
Positive affect is defined as the average of three positive affect measures in the GWP: laugh, enjoyment and doing interesting things in the GWP.
These measures are the responses to the following three questions, respectively:
“Did you smile or laugh a lot yesterday?”
“Did you experience the following feelings during A LOT OF THE DAY yesterday? How about Enjoyment?”
“Did you learn or do something interesting yesterday?”
Negative affect is defined as the average of three negative affect measures in the GWP. They are worry, sadness and anger.
Those emotional states are respective responses to the following questions:
“Did you experience the following feelings during A LOT OF THE DAY yesterday? How about Worry?”
“Did you experience the following feelings during A LOT OF THE DAY yesterday? How about Sadness?”
“Did you experience the following feelings during A LOT OF THE DAY yesterday? How about Anger?”
Dystopia is an imaginary country that has the world’s least-happy people. The purpose in establishing Dystopia is to have a benchmark against which all countries can be favorably compared (no country performs more poorly than Dystopia) in terms of each of the six key variables, thus allowing each sub-bar to be of positive (or zero, in six instances) width. The lowest scores observed for the six key variables, therefore, characterize Dystopia.
4 Variable Distributions
4.1 Distributions of happiness score and the explanatory variables
Figure 2 shows the distributions of numerical variables, for all of the plots except a, b and i, the original data from the Gallup World Poll (see Section 3.2) is used in order to see the true distributions and not the scaled ones that were calculated for the World Happiness Report.
Most of the distributions are skewed towards one side. GDP per capita, social support, healthy life expectancy and perceptions of corruption are left-skewed: The tip of the density curve is located on the right side of the plot, but there is a long tail for lower values. Generosity’s distribution on the other hand appears to be right-skewed as there are more values for high values than for very low values. The tip of the curve is close to 0 on the negative side though. The happiness score appears to to be somewhat normally distributed with 5.44 being the mean value of the answer scale 0 - 10.
For region, the African and European regions make up most of the data set. There is one small “Africa” bar that is explained by the Somaliland Region being assigned to Africa when first appearing in the world happiness report. This region functions like an own country but officially belongs to Somalia which poses a corner case for assigning country and region for this country.
The distribution of countries that participated per year shows that from 2019 onwards there has been a notable decline in countries that took part in the world happiness report.
4.2 Distribution of the Happiness Score around the globe
Figure 3 is a heatmap visualizing the happiness score around the globe based on the latest available data. The distribution of happiness scores worldwide shows clear regional patterns. Happiness scores tend to be higher in North America, Western Europe and Oceania, as indicated by the lighter colors in these regions. Countries such as Finland, which is known as the happiest country in the world with a happiness score of 7.8, are located in this region and have the highest happiness scores. This is particularly visible in Scandinavia and other parts of Northern Europe.
In severe contrast are regions such as Sub-Saharan Africa and South Asia, where happiness scores are significantly lower, as indicated by the darker colors on the map. Afghanistan is the country with the lowest happiness score with a score of 1.9 and is located in South Asia, which can also be seen on the map by the darker colors.
Other interesting patterns can be observed in Latin America and Eastern Europe, where the happiness scores are medium to high. In these regions, the scores are more variable, indicating a mix of countries with different socio-economic conditions.
Overall, the map shows that the highest happiness scores are mainly found in wealthy and developed countries, while the lowest scores occur in regions with economic and social challenges. This global distribution of happiness scores illustrates the strong regional differences in perceived well-being and suggests the significance of socio-economic factors for individual happiness.
5 Exploring the relationship between Happiness Score and its numerical explanatory variables
First, through plots the relationship between happiness score and other variables is explored. Afterwards a Kmeans model is used to look into a possible clustering of the data and how the resulting clusters can be interpreted. Lastly, we will look into how a RandomForest model rates the importance of variables when trying to predict the happiness score.Move this to beginning of main part and edit.
In this section the influence of the explanatory variables and the regions on the happiness score is explored. The box plots of happiness score over different regions highlights the differences over the globe which are already recognizable in sec-distro-globe. In order to find out what makes humans most happy, the correlations between the variables and the happiness score can give some indication. This is emphasized by scatter plots which also show the different regions’ influences on these correlations.
5.1 Influence of region on the happiness score
When looking at how the happiness score is connected to the regions of the countries we have data for, we can see that they differ a lot in their spread. This is shown in Figure 4. Western Europe and North America and ANZ have the comparatively highest median happiness scores. In contrast, for Sub-Saharan Africa and South Asia the median happiness is the lowest. The regions also differ strongly in the variance. The interquartile range (the range of variables within the first and third quarter percentiles) is larger in the Middle East and North Africa while for Africa and North America the boxes are very small.
Regions is understood as an aggregation for the variable country, as there are a big number of countries covered that it is not possible to visualize it. We are looking into countries that show interesting patterns later in the report.
5.2 Influence of the explanatory variables
Figure 5 shows the relationship of the happiness score and the explanatory variables and the colors of the dots represent which region the corresponding country belongs to. Subplot a shows the relationship between GDP per capita and the happiness score. There seems to be a positive linear relationship between the variables, as with higher GDP the happiness score rises. The relationship is stronger than it seems from the plot, as the scale ratio of x- and y-axis are so different. The correlation of these variables, which can be seen in figure Figure 6, is very high with a value of 0.72. It helps to provide further proof for the linear relationship between the variables. From the colors of the dots we can see that Sub-Saharan Africa and South Asia have both low GDP per capita and low happiness score values while Western Europe has the highest happiness score and also some of the highest GDP per capita values, followed by North America and ANZ.
Similar plots results from the relationships of Social Support (subplot b), Healthy Life Expectancy (subplot c) and Freedom to make Life Choices (subplot d) with the happiness score. There is a positive linear relationship between both variables and the correlations are not as strong as for GDP per capita but still very strong with a values of 0.68 for Healthy Life Expectancy, followed by 0.65 for Social Support and 0.57 for Freedom to make Life Choices. Also the regional influences are similar for these plots. Interestingly, for healthy life expectancy some values are missing.
For generosity (subplot e) there is no connection between the variables visible. It is notable that there are three outlier countries with high generosity values from Southeast Asia, which have a below average happiness score.
The plot of perceptions of corruption (subplot f) shows a slight upward trend in happiness score for higher perceptions of corruption but it is only a fraction of data points that score high on perceptions of corruption. Also there is missing data for this variable. The correlation plot shows that there is again a positive correlation between the two variables of 0.41, even if it is weak compared to other variables that have been looked at.
5.3 Influence of the Year
For medians, there is a slight upwards trend over the years, with 2020 to 2022 having little variance compared to the other years. There are also negative outliers for 2021 to 2023.
6 Top and Bottom 10 Ranking of Countries in Predictor Variables
7 Top and Bottom 10 Ranking of Countries for Happiness Score
8 Using K-means Clustering to find interesting patterns
It is the goal to use K-means clustering to provide a pattern that is interpretable and give further context to the relationship between the happiness score and other variables. The procedure will first be introduced, then the data is scaled for an optimal performance and the best amount of clusters will be chosen by using the elbow-criteria to sensibly minimize total within sum-of-squares. In other words, the goal is to minimize the overall distance between points in each cluster. At the end, a final model is trained and the result is plotted.
It is important to mention that K-Means can only be used for continuous variables, so country, region, and year are not considered here.
8.1 Introduction K-Means
K-Means Clustering is an unsupervised machine learning method that can help classify data that has no label that a model could be trained on. This is made possible by a distance-based approach that fits the clusters so that the distances between the data points within a cluster are minimal. The distance in this case is a measure of similarity between data points, so in other words we want to fit clusters so that the points contained in one cluster are as similar as possible.
8.2 Using PCA for K-Means performance optimization
A good measure for improving the performance of K-Means is doing PCA on the data as it can aggregate the information. However, PCA is only useful if you can cover around 80% of the total variance of the data with only few principle components.
As can be seen, it would be necessary to use four to five of seven variables to cover a sufficient amount of variance. Since this is not too helpful and each principle component contains a similar amount of variance, PCA is not applied.
8.3 Finding a good amount of clusters
In general, the higher the number of clusters, the more similar the points will be within one cluster. However, the model also gets harder to interpret and messy. Therefore, using the elbow-criterion, the hyperparameter k that equals the number of clusters should be sensibly minimized so that we get an interpretable result.
To find k, the total within sum of squares for two to ten clusters is calculated and plotted.
After k = 2 the decrease of within sum of squares is not as strong anymore. Therefore, the K-Means model is trained with two clusters.
8.4 Resulting plots
When looking at the scatterplots, the first thing to be noted is that for happiness score, there is the cleanest split between the clusters which basically splits the plot into a happiness score that is bigger than 0 and vice versa. In other words, the clusters are strongly informed by whether the happiness score is in the upper 50% quantile or below.
Median of the happiness score: -0.0004473365
For further context, the median value of the happiness score is almost perfectly zero. From the initial plots it has been shown that the happiness score correlates with all variables plotted here except for generosity. We can see that all plots in the matrix show a similar pattern of a red cluster on the right and a black cluster on the left. The stronger the correlation between any variables, the less overlap the clusters appear to have. For GDP per capita and healthy life expectancy for example, the scatterplot shows a positive linear relationship between the two variables and the overlap is very little compared to generosity’s interaction with freedom to make life choices, where most of the clusters overlap and no strong correlation exists.
9 Using Random Forest to rank feature importance
When training a random forest on data, multiple trees are fitted on random subset of predictors in each iteration. The resulting plot provides two different types of variable importance which means how much a variable is contributing to improving the model.
The left plot shows how much a variable contributes to decreasing the mean squared error of the resulting model. The mean squared error measures the difference between the true values and predictions. The variables are ordered by how much the variables increase the mean squared error when being left out. From this example, GDP per capita improves the tree by lowering the MSE the most and vice versa - when it is missing, the MSE is a lot higher.
Node Purity on the other hand represents how well the tree splits the data in similar target values. The more often variable is used for splitting a node in two further branches, the more important is the variable. In this plot GDP per capita again ranks highest, which means that it provides a useful condition on which to split groups.
When predicting the happiness score, GDP per capita and healthy life expectancy are ranked the highest for both importancy measures. This coincides with the high correlations that have been measured for the scatterplots between happiness score and the two variables.
An important thing to keep in mind is that there is an element of randomness in these results. For different seeds, the top four variables are stable, but how important they are and their ranking can be different. It also should be noted at an increased MSE of 20% for example freedom to make life choices is still a strong impact, even if this variable appears of low rank compared to others.
10 Discussion
Outlook: Could use the variables for positive and negative affect to investigate their effect on the happiness score.
11 References
- World Happiness Report. About. Retrieved July 26, 2024, from https://worldhappiness.report/about/
- World Happiness Report. World Happiness Report Appendices & Date. Retrieved July 26, 2024 from https://worldhappiness.report/data/
- Wikipedia. (2024, July). Gallup, Inc. Retrieved July 27, 2024, from https://en.wikipedia.org/wiki/Gallup,_Inc.#Gallup_World_Poll
- World Happiness Report. (2024). Happiness of the younger, the older, and those in between. Retrieved July 27, 2024, from https://worldhappiness.report/ed/2024/happiness-of-the-younger-the-older-and-those-in-between/#ranking-of-happiness-2021-2023
- World Happiness Report. (2024). World Happiness Report 2024 Figure 2.1: Country Rankings by Life Evaluations in 2021-2023. Retrieved July 27, 2024, from https://public.tableau.com/app/profile/worldhappiness/viz/2024Draft/Figure2_1
- World Happiness Report. FAQ. Retrieved July 27, 2024, from https://worldhappiness.report/faq/
- World Happiness Report. (2024, March 12). Appendix 1: Statistical Appendix for Chapter 2 of World Happiness Report 2024. Retrieved July 26, 2024, from https://happiness-report.s3.amazonaws.com/2024/Ch2+Appendix.pdf
- Islam, S. (2023) World Happiness Report up to 2023. Retrieved July 1, 2024, from https://www.kaggle.com/datasets/sazidthe1/global-happiness-scores-and-factors